-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and rework GPT-TF.js #807
base: develop
Are you sure you want to change the base?
Conversation
1d88d35
to
03e5c7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).
btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)
discojs-node/src/loaders/text.ts
Outdated
const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, { | ||
padding: false, | ||
truncation: false, | ||
return_tensor: false, | ||
}) | ||
if (tokens.length < blockSize + 1) { | ||
// throw if it happens on the 1st iteration | ||
if (iteration === 0) | ||
throw new Error(`the chunk (${tokens.length} tokens) is too small ` + | ||
`to get a sequence of length blockSize (${blockSize + 1} tokens). ` + | ||
`Either the text file or the chunk size (${chunkBitSize} bits) is too small.`); | ||
// if this isn't the first iteration we simply skip | ||
// as we expect the last chunk to be potentially smaller than the block size | ||
debug("chunk smaller than block size, loading next chunk") | ||
continue | ||
} | ||
debug("batch per chunk: %o", tokens.length / (batchSize * blockSize)) | ||
let currentPosition = 0; | ||
// yield one block of tokens at a time | ||
while (currentPosition + blockSize + 1 <= tokens.length) { | ||
yield tokens.slice(currentPosition, currentPosition + blockSize + 1); | ||
currentPosition += blockSize; // don't add + 1 here | ||
} | ||
// keep the last tokens for the next chunk | ||
// if this was the last one the remaining tokens are discarded | ||
if (currentPosition < tokens.length) { | ||
// We actually need to decode the tokens to get the leftover text | ||
// instead of simply keeping the remaining tokens. | ||
// this is because the tokens may be different once prepended to the next chunk | ||
// e.g. if the remaining text is ". A" and the next chunk starts with "nother" | ||
// the tokenization will be different than if we simply concatenate the remaining tokens | ||
endOfPreviousChunk = tokenizer.decode( | ||
tokens.slice(currentPosition), | ||
{ skip_special_tokens: true } | ||
) | ||
debug("End of chunk, remaining text: '%s'", endOfPreviousChunk) | ||
} else { | ||
// Note that the difference between tokenizing and then concatenating | ||
// vs concatenating and then tokenizing can happen if their is no | ||
// remaining text. We consider this difference negligible | ||
endOfPreviousChunk = ""; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string
) instead. that also would drop the blockSize
, batchSize
& minChunkSize
argument which aren't really relevant for reading text (separation of concerns and all that)
in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like
loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))
with tokenize updated to accept block/List<string>
instead, and maybe drop the padding (but what would be the behavior at the end of the file?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, the current implementation ended up being much simpler than my original tokenizing text loaders:
loadText('../datasets/wikitext/wiki.train.tokens')
.map(text => processing.tokenize(tokenizer, text)) // tokenize the whole chunk
.flat() // (I renamed unbatch to flat)
.batchWithOverlap(config.blockSize) // one token overlap between each batch
.map((tokens) => [tokens.pop(), tokens.last()] as [List<number>, number])
.batch(batchSize);
0a28b81
to
f7f96dc
Compare
maybe rename block size to context len, that would be more specific |
…odeling head and attention bias
… implement topk sampling
…ers following GPT2 convention, use LogLayer
… and language modeling head
…red task parameter for text tasks
…he model init config
f7f96dc
to
6dc5c51
Compare
discojs/src/dataset: implement and test repeat and batchWithOverlap
6dc5c51
to
c477bb3
Compare
68d957b
to
8cbc96e
Compare
Closes #654